1 - Cluster comparisons

Author

CDN team

Published

February 27, 2024

Introduction

In this notebook we are going to look at how to interpret and visualize gene-level statistics obtained from differential expression analysis. We are not going to go into which method should be used to carry out differential gene expression analysis but we highly recommend giving a read to A comparison of marker gene selection methods for single-cell RNA sequencing data by Jeffrey M. Pullin & Davis J. McCarthy if you’re interested in digging deeper!

Some other interesting papers and twitter discussions can be found here:

  • Why Seurat and Scanpy’s log fold change calculations are discordant - https://twitter.com/lpachter/status/1694387749967847874?s=46

  • Discrepancies between Seurat and Scanpy’s logFC - https://twitter.com/slavov_n/status/1582347828818456576

  • Differences in wilcoxon rank sum test p-value calculations between Seurat and Scanpy - https://twitter.com/Sanbomics/status/1693995213298266515

Libraries

### Make sure all the packages are installed
if (!requireNamespace("Seurat", quietly = TRUE))
    install.packages("Seurat")

if (!requireNamespace("tidyverse", quietly = TRUE))
    install.packages("tidyverse")

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

if (!requireNamespace("SingleCellExperiment", quietly = TRUE))
    BiocManager::install("SingleCellExperiment", update = FALSE)
    

if (!requireNamespace("scran", quietly = TRUE))
    BiocManager::install("scran")

if (!requireNamespace("AnnotationDbi", quietly = TRUE))
    BiocManager::install("AnnotationDbi")

if (!requireNamespace("org.Hs.eg.db", quietly = TRUE))
    BiocManager::install("org.Hs.eg.db")

if (!requireNamespace("colorBlindness", quietly = TRUE))
    install.packages("colorBlindness")

if (!requireNamespace("RColorBrewer", quietly = TRUE))
    install.packages("RColorBrewer")

if (!requireNamespace("DT", quietly = TRUE))
    install.packages("DT")

### Load all the necessary libraries
library(Seurat)
library(tidyverse)
library(SingleCellExperiment)
library(scran)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(colorBlindness)
library(RColorBrewer)
library(DT)

Load data

We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from cellxgene portal.

# Download the data in data/ directory
download.file(
    url = "https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds",
    destfile = "../data/workshop-data.rds",
    method = "wget",
    extra = "-r -p --random-wait")
# We can also use the CLI with the wget command below
# wget https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds

se <- readRDS("../data/workshop-data.rds")

Generate a color palette for our cell types

# https://www.datanovia.com/en/blog/easy-way-to-expand-color-palettes-in-r/
# nb.cols <- length(unique(se$Celltype))
# mycolors <- colorRampPalette(paletteMartin)(nb.cols)
pal <- paletteMartin
names(pal) <- sort(unique(se$Celltype))

Analysis

Set gene symbols

library(AnnotationDbi)
# BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)
symbol_id <- mapIds(
    org.Hs.eg.db,
    keys = rownames(se), 
    column = "SYMBOL",
    keytype = "ENSEMBL",
    multiVals = "first")

# df <- data.frame(symbol = symbol_id, ensembl = names(symbol_id))
all(rownames(se) == names(symbol_id))
[1] TRUE
# re-create seurat object
mtx <- se@assays$RNA@data
rownames(mtx) <- symbol_id

# Remove NAs
sum(is.na(rownames(mtx)))
[1] 9564
dim(mtx)
[1] 33234 59572
mtx <- mtx[!is.na(rownames(mtx)), ]
dim(mtx)
[1] 23670 59572
rownames(mtx) <- make.unique(rownames(mtx))
se <- CreateSeuratObject(counts = mtx, meta.data = se@meta.data)

sum(is.na(rownames(se)))
[1] 0

Quick processing

se <- se %>%
    NormalizeData(verbose = FALSE) %>%
    FindVariableFeatures(
        method = "vst",
        nfeatures = 3000,
        verbose = FALSE) %>%
    ScaleData(verbose = FALSE, features = VariableFeatures(.)) %>%
    RunPCA(verbose = FALSE) %>%
    RunUMAP(dims = 1:30, verbose = FALSE)

ElbowPlot(se, ndims = 50)

DimPlot(
    se,
    group.by = c("Celltype"),
    label = TRUE,
    cols = pal)

Seurat DGE

The different implementations Seurat incorporates provides in FindAllMarkers compare the gene expression between 2 groups of cells. This one vs all strategy is very quick and returns the avg_log2FC. This avg_log2FC is computed as detailed here & here. Since we’re working with normalized data the log2FC can be directly computed by subtracting the average expression between both groups - log(\frac{exp1}{exp2})=log(exp1)-log(exp2)

Idents(se) <- "Celltype"
mgs <- FindAllMarkers(
    se,
    test.use = "wilcox",
    only.pos = TRUE,
    logfc.threshold = 0.25,
    min.pct = 0.25)

Look at the results in a dynamic table:

DT::datatable(mgs, filter = "top")

See below how the avg_log2FC calculation is done! Code extracted from Seurat’s codebase.

features <- rownames(se) == "MS4A1"
cells.1 <- se$Celltype == "B cell, IgG+"
cells.2 <- se$Celltype != "B cell, IgG+"
data.use <- GetAssayData(object = se, assay.type = "RNA", slot = "data")
pseudocount.use <- 1
base <- 2

# Calculate fold change
mean.fxn <- function(x) {
    return(log(x = (rowSums(x = expm1(x = x)) + pseudocount.use)/NCOL(x), base = base))
  }

data.1 <- mean.fxn(data.use[features, cells.1, drop = FALSE])
data.2 <- mean.fxn(data.use[features, cells.2, drop = FALSE])

# Look at log2FC
(fc <- (data.1 - data.2))
   MS4A1 
3.811906 

Check if its equal to the avg_log2FC obtained from FindAllMarkers:

fc == mgs[mgs$cluster == "B cell, IgG+" & mgs$gene == "MS4A1", "avg_log2FC"]
MS4A1 
 TRUE 
Looking into the P-values

More details can be obtained in OSCA.

P values obtained from DGE analysis are inflated and, therefore invalid in their interpretation. We can’t use p-values to reject the Null Hypothesis since we are carrying out data snooping. This means that we are dividing the clusters based on their gene expression, and then computing p-values from the genes that are differentially expressed, even though we already know these genes are differentially expressed since we clustered the data based on them being different.

A way to show this is by looking at how skewed the distributions of the p-values obtained is:

# Compute the p-values without he thresholds
mgs2 <- FindAllMarkers(
    se,
    test.use = "wilcox",
    only.pos = TRUE,
    logfc.threshold = 0,
    min.pct = 0,
    return.thresh = 1,
    max.cells.per.ident = 100 # use 100 cells per cell type for speed
    )

ggplot(mgs2, aes(x = p_val, fill = cluster, color = cluster)) +
    # geom_histogram(alpha = 0.3, position = "identity") +
    geom_density(alpha = 0.3) +
    theme_minimal()

ggplot(mgs2, aes(x = p_val, fill = cluster, color = cluster)) +
    geom_histogram(alpha = 0.3, position = "identity") +
    facet_wrap(~cluster, scales = "free") +
    theme_minimal()

Scran DGE

Dig deeper in Orchestrating Single Cell Analysis with Bioconductor book here & here

scoreMarkers - p-values for these types of comparisons are largely meaningless; individual cells are not meaningful units of experimental replication, while the groups themselves are defined from the data. Thus, by discarding the p-values, we can simplify our marker selection by focusing only on the effect sizes between groups.

Here, the strategy is to perform pairwise comparisons between each pair of groups to obtain various effect sizes. For each group X, we summarize the effect sizes across all pairwise comparisons involving that group, e.g., mean, min, max and so on. This yields a DataFrame for each group where each column contains a different summarized effect and each row corresponds to a gene in x.

# Convert to single cell experiment
(sce <- as.SingleCellExperiment(se))
class: SingleCellExperiment 
dim: 23670 59572 
metadata(0):
assays(2): counts logcounts
rownames(23670): TSPAN6 TNMD ... LINC03094 MKKS.1
rowData names(0):
colnames(59572): AAACCCAAGAATGTTG-12 AAACCCAAGCATTGTC-19 ... TTTGTTGTCTGCTTAT-20 TTTGTTGTCTGGGATT-19
colData names(50): orig.ident nCount_RNA ... observation_joinid ident
reducedDimNames(2): PCA UMAP
mainExpName: RNA
altExpNames(0):
markers <- scoreMarkers(sce, groups = sce$Celltype)
markers[["B cell, IgG+"]] %>%
    data.frame() %>%
    filter(self.detected > 0.25) %>%
    DT::datatable(filter = "top")
markers_sub <- lapply(names(markers), function(i) {
    m <- markers[[i]]
    m$celltype <- i
    m$gene <- rownames(m)
    
    m %>%
        as_tibble() %>%
        filter(self.detected > 0.25) %>% 
        dplyr::select(gene, celltype, everything())
}) %>% bind_rows()
Example: B cell, IgG+

Let’s look at CD79A and CD74 as example genes obtained for B cell, IgG+:

markers_sub %>%
    dplyr::filter(
        celltype == "B cell, IgG+" & gene %in% c("CD79A", "CD74")) %>%
    DT::datatable()

Right away we can see how CD79A and CD74 are highly expressed (self.average) and have ubiquitous expression across all cells in B cell, IgG+ cells (self.detected). Differences occur in the other groups. CD74 is pretty much expressed at varying degrees across all other cell types at lower intensity, except in DCs, other.average = 2.2 & other.detected = 0.76. CD79A, in turn is pretty much only expressed in the Uncategorized2 and B cell, IgG- populations other.average = 0.4 & other.detected = 0.18. These patterns of expression lead to differences at the gene-level statistics such as AUC and logFC. In this case, AUC similar between both groups affected due to both their higher relative expressions when compared to the other populations. CD74 has a mean AUC of 0.88 and CD79A’s is 0.94. However, big differences arise in terms of logFC since they have a mean.logFC of 0.6 and 4.25 respectively. Therefore, by looking at these 2 parameters simultaneously we can get a good understanding at how specifically that marker is expressed in that population.

Look at the expression of CD79A and CD74 expression across groups with a violin plot:

VlnPlot(
    se,
    features = c("CD79A", "CD74"),
    group.by = "Celltype",
    cols = pal)

We can also plot all the genes as a function of their mean.AUC and mean.logFC for a quick view:

markers[["B cell, IgG+"]] %>%
    data.frame() %>% 
    tibble::rownames_to_column("gene") %>% 
    mutate(txt = case_when(
        abs(mean.AUC) > 0.75 & abs(mean.logFC.detected) > 3 ~ gene,
        TRUE ~ NA_character_
    )) %>% 
    ggplot(aes(
        x = mean.AUC,
        y = mean.logFC.detected,
        size = self.average,
        color = mean.logFC.detected,
        label = txt)) +
    geom_point() +
    ggrepel::geom_text_repel(color = "black") +
    labs(
        title = "Differential expression gene-level statistics",
        x = "mean AUC",
        y = "mean logFC",
        color = "mean logFC",
        size = "Average expression\nin self") +
    theme_classic() +
    scale_color_distiller(palette = "Spectral")

Example : CD4, EM-like

Next we’ll look at some key genes for CD4, EM-like:

markers_sub %>%
    dplyr::filter(
        celltype == "CD4, EM-like" & gene %in% c("CD4", "TRBC1", "TRBC2")) %>%
    DT::datatable()
VlnPlot(
    se,
    features = c("CD4", "TRBC1", "TRBC2"),
    group.by = "Celltype",
    cols = pal)

CD4 and TRBC1/2 have similar statistics but they come from comparing against very different subpopulations. - CD4 has been identified as a lowly expressed gene, which makes it hard to capture with scRNAseq. We see how it has a mean AUC of 0.55 and mean logFC of 2.12. These statistics are due to its simultaneous expression in monocytes as reported by Filion et al and DCs Patterson et al. TRBC1/2 in turn, show similar statistics, in this case because of their ubiquitous expression across T cell populatoins.

markers[["CD4, EM-like"]] %>%
    data.frame() %>% 
    tibble::rownames_to_column("gene") %>% 
    mutate(txt = case_when(
        abs(mean.AUC) > 0.65 & abs(mean.logFC.detected) > 1.5 ~ gene,
        TRUE ~ NA_character_
    )) %>% 
    ggplot(aes(
        x = mean.AUC,
        y = mean.logFC.detected,
        size = self.average,
        color = mean.logFC.detected,
        label = txt)) +
    geom_point() +
    ggrepel::geom_text_repel(color = "black") +
    labs(
        title = "Differential expression gene-level statistics",
        x = "mean AUC",
        y = "mean logFC",
        color = "mean logFC",
        size = "Average expression\nin self") +
    theme_classic() +
    scale_color_distiller(palette = "Spectral")

We can also compute the AUCs in Seurat as follows:

mgs_roc <- FindAllMarkers(
    se,
    test.use = "roc",
    only.pos = TRUE,
    logfc.threshold = 0.25,
    min.pct = 0.25)
mgs_roc

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DT_0.29                     RColorBrewer_1.1-3          colorBlindness_0.1.9        org.Hs.eg.db_3.17.0         AnnotationDbi_1.62.2        scran_1.28.2                scuttle_1.9.4               SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2 Biobase_2.60.0              GenomicRanges_1.52.0        GenomeInfoDb_1.36.3         IRanges_2.34.1              S4Vectors_0.38.1            BiocGenerics_0.46.0         MatrixGenerics_1.12.3       matrixStats_1.0.0           lubridate_1.9.2             forcats_1.0.0               stringr_1.5.0               dplyr_1.1.3                 purrr_1.0.2                 readr_2.1.4                 tidyr_1.3.0                 tibble_3.2.1                ggplot2_3.4.3               tidyverse_2.0.0             Seurat_5.0.0                SeuratObject_5.0.0          sp_2.0-0                   

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.21          splines_4.3.1             later_1.3.1               bitops_1.0-7              polyclip_1.10-4           fastDummies_1.7.3         lifecycle_1.0.3           edgeR_3.42.4              globals_0.16.2            lattice_0.21-8            MASS_7.3-60               crosstalk_1.2.0           magrittr_2.0.3            sass_0.4.7                limma_3.56.2              plotly_4.10.2             rmarkdown_2.24            jquerylib_0.1.4           yaml_2.3.7                metapod_1.7.0             httpuv_1.6.11             sctransform_0.4.1         spam_2.10-0               spatstat.sparse_3.0-2     reticulate_1.32.0.9002    DBI_1.1.3                 cowplot_1.1.1             pbapply_1.7-2             abind_1.4-5               zlibbioc_1.46.0           Rtsne_0.16                presto_1.0.0              RCurl_1.98-1.12           GenomeInfoDbData_1.2.10   ggrepel_0.9.3             irlba_2.3.5.1             listenv_0.9.0             spatstat.utils_3.0-3      goftest_1.2-3             RSpectra_0.16-1           spatstat.random_3.1-5     dqrng_0.3.1               fitdistrplus_1.1-11       parallelly_1.36.0         DelayedMatrixStats_1.22.6
 [46] leiden_0.4.3              codetools_0.2-19          DelayedArray_0.26.7       tidyselect_1.2.0          farver_2.1.1              ScaledMatrix_1.8.1        spatstat.explore_3.2-1    jsonlite_1.8.7            BiocNeighbors_1.18.0      ellipsis_0.3.2            progressr_0.14.0          ggridges_0.5.4            survival_3.5-7            tools_4.3.1               ica_1.0-3                 Rcpp_1.0.11               glue_1.6.2                gridExtra_2.3             xfun_0.40                 withr_2.5.0               BiocManager_1.30.22       fastmap_1.1.1             bluster_1.10.0            fansi_1.0.4               digest_0.6.33             rsvd_1.0.5                timechange_0.2.0          gridGraphics_0.5-1        R6_2.5.1                  mime_0.12                 colorspace_2.1-0          scattermore_1.2           tensor_1.5                RSQLite_2.3.1             spatstat.data_3.0-1       utf8_1.2.3                generics_0.1.3            data.table_1.14.8         httr_1.4.7                htmlwidgets_1.6.2         S4Arrays_1.2.0            uwot_0.1.16               pkgconfig_2.0.3           gtable_0.3.4              blob_1.2.4               
 [91] lmtest_0.9-40             XVector_0.40.0            htmltools_0.5.6           dotCall64_1.1-0           scales_1.2.1              png_0.1-8                 knitr_1.45                rstudioapi_0.15.0         tzdb_0.4.0                reshape2_1.4.4            nlme_3.1-163              cachem_1.0.8              zoo_1.8-12                KernSmooth_2.23-22        vipor_0.4.5               parallel_4.3.1            miniUI_0.1.1.1            ggrastr_1.0.2             pillar_1.9.0              grid_4.3.1                vctrs_0.6.3               RANN_2.6.1                promises_1.2.1            BiocSingular_1.16.0       beachmat_2.16.0           xtable_1.8-4              cluster_2.1.4             beeswarm_0.4.0            evaluate_0.21             cli_3.6.1                 locfit_1.5-9.8            compiler_4.3.1            rlang_1.1.1               crayon_1.5.2              future.apply_1.11.0       labeling_0.4.3            ggbeeswarm_0.7.2          plyr_1.8.8                stringi_1.7.12            viridisLite_0.4.2         deldir_1.0-9              BiocParallel_1.34.2       Biostrings_2.68.1         munsell_0.5.0             lazyeval_0.2.2           
[136] spatstat.geom_3.2-4       Matrix_1.6-1              RcppHNSW_0.4.1            hms_1.1.3                 patchwork_1.1.3           bit64_4.0.5               sparseMatrixStats_1.12.2  future_1.33.0             KEGGREST_1.40.0           statmod_1.5.0             shiny_1.7.5               ROCR_1.0-11               memoise_2.0.1             igraph_1.5.1              bslib_0.5.1               bit_4.0.5